1. Introduction

1.1 Overview and motivation

The phenomenon of urbanization has intensified for more than a century and has led to the development of gigantic cities around the world. While this development boosts economic activity, the concentration of human activity can lead to harsher living conditions for those affected, through to the spread of air pollution. Across the pond, New York City has grown dramatically in both height and width. Following discoveries on Air Pollution, respecting the environment has become more important for New Yorkers as in 2021, New York State has enshrined the right to a healthy environment in its Constitution. To follow this, the City has been tracking air pollution parameters in all five boroughs (Manhattan, the Bronx, Brooklyn, Queens and Staten Island). By digging into all the data available on NYC (https://opendata.cityofnewyork.us), we were intrigued to see how air pollution levels can vary across the different boroughs. This data leads us to ponder the subject of how thresholds of air pollution are respected across the city as well as how inequalities in air pollution are treated.

1.3 Project objectives

The goal of this research is to assess levels of air pollution in New York City in regards to respective thresholds, given by the United States Environmental Protection Agency EPA (https://www.epa.gov/pm-pollution/national-ambient-air-quality-standards-naaqs-pm) & the WHO, isolate the main drivers of air pollution, and finally investigate the disparities according to socio-economic factors.

At first, we will investigate the air pollution levels in each UHF area of New York City over the past 11 years, to see whether the pollution standards are being met. We will then calculate the correlation between air pollution and certain characteristics of the areas (population density, density of green spaces, access to public transportation, etc.). Furthermore, since NYC has considerable socioeconomic inequalities, we will consider the financial status of the boroughs’ inhabitants to observe if those with greater financial resources are in a better position regarding air quality.

1.4 Research questions

  1. How are the thresholds of Air Pollution respected in NYC? In what Proportion?
  2. What are the main characteristics of an area contributing to Air Pollution? Do the Boroughs have distinct characteristics?
  3. Are there disparities in levels of Air Pollution between wealthier and poorer areas?

2. Data

  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data

2.1 Data source

We decided to use the United Hospital Fund Neighborhoods 42 (UHF42) geographical segmentation to study our zone of interest, New York City. United Hospital Fund (UHF) 42 New York City Segmentation is a geographic and socioeconomic segmentation of New York City developed by UHF in 2019. This segmentation divides the city into 42 distinct clusters of neighborhoods based on their unique demographic characteristics. The segmentation uses a combination of data from the U.S. Census Bureau, the Department of City Planning, and other public sources to create a comprehensive view of New York City neighborhoods. The segmentation is used to help hospitals, health care providers, and other stakeholders better understand the needs of the communities they serve and improve health care delivery.

The data sets belong to : air quality, common factors of pollution and socio-demographic factors. The New York open data portal provided us with the necessary data for the first two themes. For the third, the income data comes from “Name Census : US Demographic data” and the population figures from the “Prison Police initiative” website (https://www.prisonpolicy.org/origin/ny/uhf_districts.html).

Below is a look at our data sets and their main variables.

2.2 Data description

2.2.1 Air Quality

  • Fine Particulate Matter (PM2.5): Air samples collected at specific NYCCAS monitoring sites along with information about emissions sources were incorporated into a statistical model that predicted pollutant concentrations at all locations in NYC for the specified time period. In NYC, fine particulate matter is measured in units of micrograms per cubic metre of air.

  • Time: Time period of Air samples collection.

  • GeoID: Numeric code of geographic area.

  • GeoType: Geography type. There are several methods of delimitation of the geographical area. The most used are UHF42, UHF34, CD and Borough.

  • GeoRank: Geography Rank.

  • Geography: Name of the geography area according to their GeoType.

  • Mean mcg per cubic metre: The results were then assigned to the appropriate NYC neighbourhood and averaged

  • 0th Percentile, mcg per cubic metre : The results were then assigned to the appropriate NYC neighbourhood and the 10th percentile identified.

  • 90th Percentile, mcg per cubic metre : The results were then assigned to the appropriate NYC neighbourhood and the 90th percentile identified.

(data source:https://a816-dohbesp.nyc.gov/IndicatorPublic/beta/data-explorer/air-quality/?id=2023#display=summary)

2.2.2 Common factors

As opposed to the PM2.5 data, we don’t have continuous data for every year for our common factors. We felt they are still relevant for our analysis as these values do not tend to change drastically from one year to another, especially in such a developed city such as New York. For example, subway density variations are quite rare as the space for public transport is extremely congested and there are close to zero opportunities to build new subway stops. Similarly, vegetative cover will not change in a significant way as they city’s parks have been established for years. Later in the analysis, we will be using the year 2017 as a reference point.

  • Vegetative Cover: Vegetative cover (2017) is the land covered by trees, grass, or other plants. This area is not covered by a hard surface such as paved roads, sidewalks, or buildings. Vegetative cover tends to reduce temperatures in the immediate area and may increase air quality.

  • Time: Time period.

  • GeoID: Numeric code of geographic area.

  • GeoType: Geography type. There are several methods of delimitation of the geographical area. The most used are UHF42, UHF34, CD and Borough.

  • GeoRank: Geography Rank.

  • Geography: Name of the geography area according to their GeoType.

  • Percent: Percentage of vegetative cover

(data source: https://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=2143,719b87,107,Summarize)

  • Walking distance to a subway station : The percentage of population within a quarter-mile of a subway station entrance. When more people have convenient access to the transit system, they’re more likely to use it. Subway access encourages active transportation, which improves the health of residents. This data is for 2018.

  • Time: Time period.

  • GeoID: Numeric code of geographic area.

  • GeoType: Geography type. There are several methods of delimitation of the geographical area. The most used are UHF42, UHF34, CD and Borough.

  • GeoRank: Geography Rank.

  • Geography: Name of the geography area according to their GeoType.

  • Percent: Census block populations within areas defined as walking distance were summed across the neighborhood and divided by total neighborhood population.

(data source: https://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=2391,719b87,104,Summarize)

  • Subway density: Neighborhoods with greater subway access encourage active transportation. This data is for 2014.

  • Time: Time period.

  • GeoID: Numeric code of geographic area.

  • GeoType: Geography type. There are several methods of delimitation of the geographical area. The most used are UHF42, UHF34, CD and Borough.

  • GeoRank: Geography Rank.

  • Geography: Name of the geography area according to their GeoType.

  • Density: The count of MTA subways stations as of 2012 divided by the total land area in km2 of the UHF neighborhood.

(data source: https://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=2158,719b87,104,Summarize)

  • Bicycle network: Percent of Streets with Bicycle Lanes. This data is from 2017.

  • Time: Time period.

  • GeoID: Numeric code of geographic area.

  • GeoType: Geography type. There are several different methods of delimitation of the geographical area. The most used are UHF42, UHF34, CD and Borough.

  • GeoRank: Geography Rank.

  • Geography: Name of the geography area according to their GeoType.

  • Percent: ratio of streets with bicycle area against streets without bicycle area.

(data source: https://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=2390,719b87,104,Summarize)

  • Traffic density: A measure of the average number of vehicles that occupy a specified area. Traffic density can influence health as a source of air pollution. This data is from 2016.

  • Time: Time period.

  • GeoID: Numeric code of geographic area.

  • GeoType: Geography type. There are several methods of delimitation of the geographical area. The most used are UHF42, UHF34, CD and Borough.

  • GeoRank: Geography Rank.

  • Geography: Name of the geography area according to their GeoType.

  • Million miles (per km2): Vehicle-miles are expressed in millions per square kilometer.

(data source: https://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=2112,719b87,114,Summarize)

2.2.3 Socio-demographics factors

  • Income per capita: This data set provides information about the median household income in New York City by UHF area. These incomes levels are for 2022.

  • GeoID: Numeric code of the UHFgeographic area.

  • MedianIncome : Median Household Income of given UHF area.

(data source: 10001 Zip Code Income, Population and Demographics)

  • Population Density: This data set provides us with the population per square in mile in New York City.

  • GeoID: Numeric code of geographic area.

  • Popcount : Population count

  • Km2: Size of area in square kilometers.

  • Popdensity: Result of Popcount/Km2 for a given area.

(data source: https://www.prisonpolicy.org/origin/ny/uhf_districts.html)

  • Neighborhood poverty: The percent of households with incomes below the federal poverty level. This data is an average of the annual values between 2015 and 2019.

  • Time: Time period.

  • GeoID: Numeric code of geographic area.

  • GeoType: Geography type. There are several methods of delimitation of the geographical area. The most used are UHF42, UHF34, CD and Borough.

  • GeoRank: Geography Rank.

  • Geography: Name of the geography area according to their GeoType.

  • Number: Estimated number of people for whom poverty status is determined, whose annual income falls below 100% of the federal poverty level.

  • Percent: Estimated number of people for whom poverty status is determined, whose annual income falls below 100% of the federal poverty level, divided by the number of people for whom poverty status is determined; expressed as percent.

(data source: https://a816-dohbesp.nyc.gov/IndicatorPublic/beta/data-explorer/economic-conditions/?id=103#display=links)

2.3 Data Import and cleaning

By looking at our data description, we can see that there are a lot of similar variables between all of our data sets. For example, the Geographical names of the units, as well as their Geographical ID are the same. It is interesting as we will be able to merge all of the data sets to create an harmonized dataframe later on. In order to do that, we start by importing all the datasets. We decided to only keep the following variables going forward: Time, Geography, GeoID and the unique values from each data set. For the data of Income and Population density, there was no available data and we had to get creative by merging other geographic segmentations and converting them to our UHF42 standard.

2.3.1 Air Quality

We begin by importing our excel file “PM2.5”. Since we are only interested in the UHF42 values, we are going to filter our rows accordingly. We also decided to only keep the “annual average values” for the year 2010 to 2021 in the time column. This way, we will be able to show how air pollution has changed in the city during the last decade. Below are the 5 first rows of the data set.

# Loading the csv file
pm2.5 <- read.csv(file = here::here("data/PM2.5.csv"))

# Filtering rows based on UHF42 and years 2010-2021
newpm2.5 <- dplyr::filter(pm2.5, GeoType == "UHF42" & Time %in% c("Annual Average 2021","Annual Average 2020","Annual Average 2019", "Annual Average 2018", "Annual Average 2017","Annual Average 2016","Annual Average 2015","Annual Average 2014","Annual Average 2013","Annual Average 2012","Annual Average 2011","Annual Average 2010"))

# Selecting appropriate variables
dfpm2.5 <- dplyr::select(newpm2.5, Time, GeoID, Geography, Mean.mcg.m3)

# Displaying first 5 rows in a table
kable((dfpm2.5[1:5,]),"html") %>%
  kable_styling(full_width = F)
Time GeoID Geography Mean.mcg.m3
Annual Average 2021 101 Kingsbridge - Riverdale 6.57
Annual Average 2021 102 Northeast Bronx 6.49
Annual Average 2021 103 Fordham - Bronx Pk 6.58
Annual Average 2021 104 Pelham - Throgs Neck 6.60
Annual Average 2021 105 Crotona -Tremont 6.79

2.3.2 Common Factors

We pursue with the same logic for the common factors.

Vegetative Cover

# Loading the csv file
vegetative_cover <- read.csv(file = here::here("data/Vegetative_Cover.csv"))

# Selecting appropriate variables
dfvegetative_cover <- dplyr::select(vegetative_cover, Time, GeoID, Geography, Percent)

# Changing the name of column "percent" to "VegePercent"
colnames(dfvegetative_cover)[4] = "VegePercent"

# Displaying first 5 rows in a table
kable((dfvegetative_cover[1:5,]),"html") %>%
  kable_styling(full_width = F)
Time GeoID Geography VegePercent
2017 101 Kingsbridge - Riverdale 63
2017 102 Northeast Bronx 36
2017 103 Fordham - Bronx Pk 40
2017 104 Pelham - Throgs Neck 46
2017 105 Crotona -Tremont 32

Walking distance to a subway station

# Loading the csv file
walk_dist_subway <- read.csv(file = here::here("data/Walking_Distance_Subway.csv"))

# Selecting appropriate variables
dfwalk_dist_subway <- dplyr::select(walk_dist_subway, Time, GeoID, Geography, Percent)

# Changing the name of column "percent" to "WalkSubDist"
colnames(dfwalk_dist_subway)[4] = "WalkSubDist"

# Displaying first 5 rows in a table
kable((dfwalk_dist_subway[1:5,]),"html") %>%
  kable_styling(full_width = F)
Time GeoID Geography WalkSubDist
2018 101 Kingsbridge - Riverdale 33
2018 102 Northeast Bronx 25
2018 103 Fordham - Bronx Pk 69
2018 104 Pelham - Throgs Neck 32
2018 105 Crotona -Tremont 54

Subway density

# Loading the csv file
subway_density <- read.csv(file = here::here("data/Subway_Density.csv"))

# Selecting appropriate variables
dfsubway_density <- dplyr::select(subway_density, Time, GeoID, Geography, Density)

# Changing the name of column "percent" to "SubdDensity"
colnames(dfsubway_density)[4] = "SubDensity"

# Displaying first 5 rows in a table
kable((dfsubway_density[1:5,]),"html") %>%
  kable_styling(full_width = F)
Time GeoID Geography SubDensity
2014 101 Kingsbridge - Riverdale 0.330
2014 102 Northeast Bronx 0.474
2014 103 Fordham - Bronx Pk 1.110
2014 104 Pelham - Throgs Neck 0.353
2014 105 Crotona -Tremont 0.920

Bicycle network

# Loading the csv file
bicycle_network <- read.csv(file = here::here("data/Bicycle_Network.csv"))

# Selecting appropriate variables
dfbicycle_network <- dplyr::select(bicycle_network, Time, GeoID, Geography, Percent)

# Changing the name of column "percent" to "BicPercent"
colnames(dfbicycle_network)[4] = "BicPercent"

# Displaying first 5 rows in a table
kable((dfbicycle_network[1:5,]),"html") %>%
  kable_styling(full_width = F)
Time GeoID Geography BicPercent
2017 101 Kingsbridge - Riverdale 7
2017 102 Northeast Bronx 9
2017 103 Fordham - Bronx Pk 12
2017 104 Pelham - Throgs Neck 8
2017 105 Crotona -Tremont 14

Traffic density

# Loading the csv file
traffic_density <- read.csv(file = here::here("data/Traffic_Density.csv"))

# Selecting appropriate variables
dftraffic_density <- dplyr::select(traffic_density, Time, GeoID, Geography, TrafficDensity)

# Changing the name of column "percent" to "TrafDensity"
colnames(dftraffic_density)[4] = "TrafDensity"

# Displaying first 5 rows in a table
kable((dftraffic_density[1:5,]),"html") %>%
  kable_styling(full_width = F)
Time GeoID Geography TrafDensity
2016 101 Kingsbridge - Riverdale 28.8
2016 102 Northeast Bronx 17.1
2016 103 Fordham - Bronx Pk 21.5
2016 104 Pelham - Throgs Neck 23.9
2016 105 Crotona -Tremont 36.4

2.3.3 Socio-demographics factors

Population Density

In order to calculate the population density, we need two elements: population count and the square kilometers of each UHF zone. On the Prison Policy initiative website, we found a table giving the total population per UHF code, but there wasn’t a possibility of downloading a csv file. We decided to choose the data scraping method to retrieve these vital pieces of data. To do so, we use the rvest package to data scrape the website. We create the html variable and assign it to the website, and using a CSS selector, we are able to select the data we want. We select the 84 elements corresponding to the sum of UHF codes and population counts, and create a population variable.

library(rvest)

html <- read_html("https://www.prisonpolicy.org/origin/ny/uhf_districts.html")

population <- html %>% 
  html_elements("td:nth-child(1) , td:nth-child(6)") %>% 
  html_text2() %>% 
  .[1:84]

Our result is now located in a vector and we need to convert it to a dataframe. First, we sequence along even and odd numbers, based on index of vectors to create seperated vectors containing UHF code and population count. Then, we create a dataframe and rename the columns. Now, we convert character columns into numeric. We also remove the commas in the Popcount column.

n <- length(population)
popuhfcode <- c(population[seq(n) %% 2 == 1])
popcount <- c(population[seq(n) %% 2 == 0])

dfpop<- data.frame(popuhfcode,popcount)

colnames(dfpop)[1] = "GeoID"
colnames(dfpop)[2] = "Popcount"

# Now we convert character columns into numeric. We also remove the comas in the column Popcount
dfpop$GeoID <- as.numeric(dfpop$GeoID)
dfpop$Popcount <- as.numeric(sub(",", "", dfpop$Popcount,fixed = TRUE))

For the Population density of each UHF code, we use the same techniques. However, we face another difficulty: we only have the values per Zip Code, not by UHF42 areas. We initially tried doing things manually by using a Map and writing the conversion by hand, but this was extremely inaccurate and we need to reach precise results. We wrote an email to the NYC open data, who nicely sent us a conversation table for transforming Zip Codes into the respective UHF areas.

html2 <- read_html("https://namecensus.com/zip-codes/new-york/")

squaremeter <- html2 %>% 
  html_elements("td:nth-child(4) , td:nth-child(1)") %>% 
  html_text2() %>% 
  .[1:2580]

n <- length(squaremeter)
zipcode <- c(squaremeter[seq(n) %% 2 == 1])
sqmeter <- c(squaremeter[seq(n) %% 2 == 0])
uhfsize <- data.frame(zipcode,sqmeter)

uhfsize$zipcode <- as.numeric(uhfsize$zipcode)
uhfsize$sqmeter<- as.numeric(gsub(",", "", uhfsize$sqmeter, fixed = TRUE))

# We convert the square meters to KM2
uhfsize$sqmeter<- as.numeric(format(as.numeric(uhfsize$sqmeter) / 1000000))

In order to filter the rows, we have to define each Zip Code based on their UHF

"UHF101" <- c(10463,10471)
"UHF102" <- c(10466,10469,10470,10475)
"UHF103" <- c(10458,10467,10468)
"UHF104" <- c(10461,10462,10464,10465,10472,10473)
"UHF105" <- c(10453,10457,10460)
"UHF106" <- c(10451,10452,10456)
"UHF107"<- c(10454,10455,10459,10474)
"UHF201"<- c(11211,11222)
"UHF202"<- c(11201,11205,11215,11217,11231)
"UHF203"<- c(11213,11212,11216,11233,11238)
"UHF204"<- c(11207,11208)
"UHF205"<- c(11220,11232)
"UHF206" <- c(11204,11218,11219,11230)
"UHF207"<- c(11203,11210,11225,11226)
"UHF208"<- c(11234,11236,11239)
"UHF209"<- c(11209,11214,11228)
"UHF210"<- c(11223,11224,11229,11235)
"UHF211"<- c(11206,11221,11237)
"UHF301"<- c(10031,10032,10033,10034,10040)
"UHF302"<- c(10026,10027,10030,10037,10039)
"UHF303"<- c(10029,10035)
"UHF304"<- c(10023,10024,10025)
"UHF305"<- c(10021,10028,10044,10128)
"UHF306"<- c(10001,10011,10018,10019,10020,10036)
"UHF307"<- c(10010,10016,10017,10022)
"UHF308"<- c(10012,10013,10014)
"UHF309"<- c(10002,10003,10009)
"UHF310"<- c(10004,10005,10006,10007,10038,10280)
"UHF401"<- c(11101,11102,11103,11104,11105,11106)
"UHF402"<- c(11368,11369,11370,11372,11373,11377,11378)
"UHF403"<- c(11354,11355,11356,11357,11358,11359,11360)
"UHF404"<- c(11361,11362,11363,11364)
"UHF405"<- c(11374,11375,11379,11385)
"UHF406"<- c(11365,11366,11367)
"UHF407"<- c(11414,11415,11416,11417,11418,11419,11420,11421)
"UHF408"<- c(11412,11423,11432,11433,11434,11435,11436)
"UHF409"<- c(11004,11005,11411,11413,11422,11426,11427,11428,11429)
"UHF410"<- c(11691,11692,11693,11694,11695,11697)
"UHF501"<- c(10302,10303,10310)
"UHF502"<- c(10301,10304,10305)
"UHF503"<- c(10314)
"UHF504"<- c(10306,10307,10308,10309,10312)

Now, we perform filtering and keep only relevant Zip Codes.

uhfselected <- uhfsize %>% filter(uhfsize$zipcode %in% c(10463,10471,10466,10469,10470,10475,10458,10467,10468,10461,10462,10464,10465,10472,10473,10453,10457,10460,10451,10452,10456,10454,10455,10459,10474,11211,11222,11201,11205,11215,11217,11231,11213,11212,11216,11233,11238,11207,11208,11220,11232,11204,11218,11219,11230,11203,11210,11225,11226,11234,11236,11239,11209,11214,11228,11223,11224,11229,11235,11206,11221,11237,10031,10032,10033,10034,10040,10026,10027,10030,10037,10039,10029,10035,10023,10024,10025,10021,10028,10044,10128,10001,10011,10018,10019,10020,10036,10010,10016,10017,10022,10012,10013,10014,10002,10003,10009,10004,10005,10006,10007,10038,10280,11101,11102,11103,11104,11105,11106,11368,11369,11370,11372,11373,11377,11378,11354,11355,11356,11357,11358,11359,11360,11361,11362,11363,11364,11374,11375,11379,11385,11365,11366,11367,11414,11415,11416,11417,11418,11419,11420,11421,11412,11423,11432,11433,11434,11435,11436,11004,11005,11411,11413,11422,11426,11427,11428,11429,11691,11692,11693,11694,11695,11697,10302,10303,10310,10301,10304,10305,10314,10306,10307,10308,10309,10312))

uhfselected2 <- uhfselected %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF101) ,101)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF102) ,102)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF103) ,103)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF104) ,104)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF105) ,105)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF106) ,106)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF107) ,107)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF201) ,201)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF202) ,202)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF203) ,203)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF204) ,204)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF205) ,205)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF206) ,206)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF207) ,207)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF208) ,208)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF209) ,209)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF210) ,210)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF211) ,211)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF301) ,301)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF302) ,302)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF303) ,303)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF304) ,304)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF305) ,305)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF306) ,306)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF307) ,307)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF308) ,308)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF309) ,309)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF310) ,310)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF401) ,401)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF402) ,402)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF403) ,403)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF404) ,404)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF405) ,405)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF406) ,406)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF407) ,407)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF408) ,408)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF409) ,409)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF410) ,410)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF501) ,501)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF502) ,502)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF503) ,503)) %>%
  as.data.frame() %>%
  mutate(zipcode=replace(zipcode, zipcode %in% (UHF504) ,504)) %>%
  as.data.frame() 

Here, we group by Zip Code and create our square kilometers column. Then, we convert to dataframe and rename the GeoID column. Finally, we create a new datafram with our two tables and add new column for population density.

uhfselected3 <-uhfselected2  %>% group_by(zipcode) %>% 
  summarise(Km2 = sum(sqmeter),
            .groups = 'drop')

dfuhfsize <- uhfselected3 %>% as.data.frame()
colnames(dfuhfsize)[1] <- "GeoID"

dfpop_density <- merge(dfpop,dfuhfsize, by="GeoID")
dfpop_density <- mutate(dfpop_density, Popdensity = Popcount/Km2)

dfpop_density <- round(dfpop_density)

kable((dfpop_density[1:5,]),"html") %>%
  kable_styling(full_width = F)
GeoID Popcount Km2 Popdensity
101 94095 10 9013
102 190315 20 9617
103 250249 11 21894
104 297059 36 8168
105 206630 9 24224

Median Income per household

Using the conversion table for transforming Zip Codes, we create the GeoID and Zipcode variables. Last but not least, we had to enter the income data by hand, one zip code after another. We didn’t manage to find an available csv file or a website to scrape. We had to click on each Zip Code individually and enter them into a new vector: Income.


GeoID <- c(101,101,102,102,102,102,103,103,103,104,104,104,104,104,104,105,105,105,106,106,106,107,107,107,107,201,201,202,202,202,202,202,203,203,203,203,203,204,204,205,205,206,206,206,206,207,207,207,207,208,208,208,209,209,209,210,210,210,210,211,211,211,301,301,301,301,301,302,302,302,302,302,303,303,304,304,304,305,305,305,305,306,306,306,306,306,306,307,307,307,307,308,308,308,309,309,309,310,310,310,310,310,310,401,401,401,401,401,401,402,402,402,402,402,402,402,403,403,403,403,403,403,403,404,404,404,404,405,405,405,405,406,406,406,407,407,407,407,407,407,407,407,408,408,408,408,408,408,408,409,409,409,409,409,409,409,409,409,410,410,410,410,410,410,501,501,501,502,502,502,503,504,504,504,504,504)

Zipcode <- c(10463,10471,10466,10469,10470,10475,10458,10467,10468,10461,10462,10464,10465,10472,10473,10453,10457,10460,10451,10452,10456,10454,10455,10459,10474,11211,11222,11201,11205,11215,11217,11231,11213,11212,11216,11233,11238,11207,11208,11220,11232,11204,11218,11219,11230,11203,11210,11225,11226,11234,11236,11239,11209,11214,11228,11223,11224,11229,11235,11206,11221,11237,10031,10032,10033,10034,10040,10026,10027,10030,10037,10039,10029,10035,10023,10024,10025,10021,10028,10044,10128,10001,10011,10018,10019,10020,10036,10010,10016,10017,10022,10012,10013,10014,10002,10003,10009,10004,10005,10006,10007,10038,10280,11101,11102,11103,11104,11105,11106,11368,11369,11370,11372,11373,11377,11378,11354,11355,11356,11357,11358,11359,11360,11361,11362,11363,11364,11374,11375,11379,11385,11365,11366,11367,11414,11415,11416,11417,11418,11419,11420,11421,11412,11423,11432,11433,11434,11435,11436,11004,11005,11411,11413,11422,11426,11427,11428,11429,11691,11692,11693,11694,11695,11697,10302,10303,10310,10301,10304,10305,10314,10306,10307,10308,10309,10312)

Income <- c(60397.00,93657.00,58405.00,67190.00,64643.00,53819.00,37886.00,40639.00,37777.00,60802.00,57266.00,101935.00,74889.00,36038.00,47856.00,31778.00,32378.00,28831.00,32598.00,31351.00,31645.00,23337.00,29439.00,31896.00,28419.00,90871.00,96435.00,139697.00,62952.00,144330.00,122598.00,108525.00,48318.00,29385.00,77343.00,48278.00,100568.00,42276.00,45946.00,55119.00,74072.00,54984.00,78249.00,41907.00,57770.00,57707.00,70776.00,65200.00,60896.00,84116.00,74015.00,29571.00,79524.00,57639.00,76278.00,56908.00,35776.00,66332.00,56308.00,43065.00,65458.00,57375.00,55721.00,53690.00,66902.00,63556.00,53199.00,64716.00,57010.00,45555.00,50686.00,41363.00,33801.00,28408.00,137347.00,137126.00,99067.00,130968.00,131013.00,112658.00,117926.00,96787.00,136208.00,136360.00,101651.00,0.00,97720.00,139188.00,124647.00,130417.00,138661.00,110490.00,130675.00,136154.00,35607.00,129981.00,68220.00,204949.00,184681.00,185268.00,250001.00,95640.00,170385.00,124949.00,81462.00,81677.00,75772.00,86516.00,71865.00,56904.00,61860.00,67582.00,61239.00,56835.00,59910.00,82326.00,45838.00,45527.00,64324.00,82858.00,70777.00,0.00,84356.00,86545.00,92212.00,111094.00,87135.00,75266.00,88117.00,89700.00,77044.00,70102.00,91795.00,64332.00,75599.00,75476.00,73471.00,79210.00,73856.00,79589.00,85190.00,77738.00,85235.00,70990.00,66270.00,62453.00,65845.00,69873.00,72962.00,97845.0,75742.00,104269.00,91836.00,93397.00,102770.00,81688.00,77466.00,82532.00,52605.00,53077.00,52946.00,87755.00,0.00,116205.00,72394.00,70315.00,86895.00,70065.00,64394.00,77136.00,90306.00,83309.00,108808.00,110478.00,102730.00,96785.00)

With these three vectors, we create a dataframe income and group it by GeoID.

income <- data.frame(GeoID, Zipcode, Income)

# Group by GeoID
dfincome <- income  %>% group_by(GeoID) %>% 
  summarise(MedianIncome = (sum(Income)/(length(GeoID))),
            .groups = 'drop')

dfincome <- round(dfincome)

kable((dfincome[1:5,]),"html") %>%
  kable_styling(full_width = F)
GeoID MedianIncome
101 77027
102 61014
103 38767
104 63131
105 30996

Neighborhood poverty

# Loading the csv file
Neighborhood_Poverty <- read.csv(file = here::here("data/Neighborhood_Poverty.csv"))

# Selecting appropriate variables
dfNeighborhood_Poverty <- dplyr::select(Neighborhood_Poverty, GeoID,Geography,Percent)

# Changing the name of column "percent" to "NeighPercent"
colnames(dfNeighborhood_Poverty)[3] = "NeighPercent"

# Displaying first 5 rows in a table
kable((dfNeighborhood_Poverty[1:5,]),"html") %>%
  kable_styling(full_width = F)
GeoID Geography NeighPercent
101 Kingsbridge - Riverdale 15.1
102 Northeast Bronx 15.3
103 Fordham - Bronx Pk 29.4
104 Pelham - Throgs Neck 21.6
105 Crotona -Tremont 37.0

Now, we merge all the tables into a single dataframe, and we remove the column year. We keep the PM2.5 in a seperate table for now, as we will start our analysis on it’s evolution through the last decade.

Factors <- (list(dfvegetative_cover[,-1],dfwalk_dist_subway[,c(2,4)],dfbicycle_network[,c(2,4)],dfNeighborhood_Poverty[,c(1,3)], dfsubway_density[,c(2,4)], dftraffic_density[,c(2,4)],dfincome[,c(1,2)],dfpop_density[,c(1,4)]) %>% reduce(inner_join, by='GeoID'))

library(DT)

datatable(
  Factors,
  options = list(
    columnDefs = list(list(className = 'dt-center', targets = 5)),
    scrollX = TRUE,
    pageLength = 10,
    lengthMenu = c(10,20,30,40,42)
  )
)

3. Exploratory Data Analysis

3.1 Mapping the UHF

As we’re working on tangible geographical values, it’s essential to plot out our variables on a map. For this, we discovered the shapefile format and learnt how to manipulate it. On the NYC open data website, we found a shapefile mapping out all 42 UHF areas(https://www.nyc.gov/site/doh/data/data-sets/maps-gis-data-files-for-download.page : “Other GIS DATA”, second link(zip file)). Using the tmap package, we can see below a map of NYC according to the 42 UHF areas.

library("rgdal")
library("tmap")

# Import geographic file in a variable
UHFmapfile <- readOGR(dsn = here::here("spatial/UHF_42_DOHMH_2009"),layer="UHF_42_DOHMH_2009")
UHFmap <- UHFmapfile[UHFmapfile$OBJECTID > 1,]

# Plot the map
# Simple plot
tm_shape(UHFmap)+
  tm_polygons(col = "white",
              title = "UHFCODE")

# interactive map 10 years with PM2.5
# Rename from categorical to numerical Time column

dfpm2.5$Time <- str_replace(dfpm2.5$Time, "Annual Average", "")
dfpm2.5$Time <- as.numeric(dfpm2.5$Time)

# rename column in pm2.5 to match spf name
colnames(dfpm2.5)[2] = "UHFCODE"

# merge variables from Factors to UHFmap
uhfpm <- merge(UHFmap, dfpm2.5, by = "UHFCODE", duplicateGeoms = TRUE)
# load the viridis package for dynamic graph
library(viridis)

Here is a representation of NYC with the detail of each UHF area by number.

tm_shape(UHFmap) +
  tm_polygons(col = "white",
              title = "UHFCODE") + 
    tm_text("UHFCODE", size = 0.5)

To better understand the geographical position of each borough, we’ve created this map and color coded each part of NYC.

# Map NY UHF to Borough

# Connecting geographic info with data frame "Factors"
# rename column UHFCode to match spf name
colnames(Factors)[1] = "UHFCODE"

# merge variables from Factors to UHFmap
testshape <- merge(UHFmap, Factors, by = "UHFCODE")

tm_shape(testshape) +
  tm_polygons(col = "UHFCODE", 
              style = "fixed",
              palette = "Pastel1",
              breaks = c(0, 108, 212, 311, 411,504),
              labels = c("Bronx", "Brooklyn","Manhattan","Queens", "Staten Island"))

3.2 Mapping the evolution of PM2.5 from 2010 to 2021

With this shapefile, we’ve created a new visualisation called uhfpm, which we’ve connected to the original shapefile with the “UHFCODE” variable. Thanks to this, we can now use our dfpm2.5 table.

Then, we did the same with our Factors table and have created the testshape table. We have decided to divide the data into two tables because our PM2.5 values are traced over 11 years, while our Factors table contains data only for one year. Now, we can regroup all UHF Codes to their respective boroughs and map them out on the original shapefile. For precision, we can note that Central Park and JFK Airport were left out of the intial shapefile, hence the large white rectangle in the middle of Manhattan.

# Create dataframe to link factors with pm2.5 values for 2017 as reference point
# start by renaming GeoID from factors to UHFCODE
colnames(Factors)[1] = "UHFCODE"
testpm2017 <- filter(dfpm2.5, Time == 2017)
head(testpm2017)
#>   Time UHFCODE                Geography Mean.mcg.m3
#> 1 2017     101  Kingsbridge - Riverdale        7.41
#> 2 2017     102          Northeast Bronx        7.46
#> 3 2017     103       Fordham - Bronx Pk        7.52
#> 4 2017     104     Pelham - Throgs Neck        7.52
#> 5 2017     105         Crotona -Tremont        8.16
#> 6 2017     106 High Bridge - Morrisania        8.16
pm2017 <- dplyr::select(testpm2017,UHFCODE,Mean.mcg.m3)
testFactors_and_PM <- merge(pm2017,Factors, by='UHFCODE')

# Remove Geography column
Factors_and_PM <- testFactors_and_PM[,-3]

Going further with the use of the shapefile, we use uhfpm to show the evolution of PM2.5 between the years 2010 to 2021. Through this visualization, we saw a decrease in Air pollution levels across all five boroughs of the city. We also find that there are two UHF areas, in the center of Manhattan which have decreased a lot less than the neighbouring areas. Drawing on this, we emit the hypothesis that this area probably has a much higher level of traffic density and lower vegetative cover on average.

# Multiple maps per year
tm_shape(uhfpm) +
  tm_polygons(
    col = "Mean.mcg.m3",
    style = "cont",
    pal = viridis(10, direction = -1),
    ) +
  tm_facets(by = "Time", ncol = 3) +
  tm_layout(legend.outside.size = 0.1)

3.3 Animation of the evolution of PM2.5 from 2010 to 2021

For a better visualization of this evolution, we track the years by using a GIF.

# Animated graphic
library(gifski)
library(tmap)
library(viridis)
library(installr)
library(magick)
library(caTools)
#installr::install.imagemagick()
#devtools::install_github('talgalili/installr') 

pf2.5_animation <- tm_shape(uhfpm) +
    tm_text("UHFCODE", size = 0.6 )+
    tm_polygons(
    col = "Mean.mcg.m3",
    style = "cont",
    pal = viridis(10, direction = -1)
    ) +
  tm_facets(along = "Time") +
  tm_layout(legend.position = c("right", "bottom"),title= 'Mean average PM2.5 evolution',title.size = 1.5,legend.title.size = 1, legend.text.size = 1,  title.position = c('left', 'top'))

tmap_animation(pf2.5_animation,filename = "animation.gif",delay = 90)
#> Creating frames
#> ============
#> =====
#> ======
#> ======
#> =====
#> ======
#> ======
#> ======
#> ======
#> =====
#> ======
#> 
#> Creating animation
#> Animation saved to /Users/julianchanganaqui/Desktop/dsiba_project_groupJ/report/animation.gif

We can also visualise this decrease with the evolution of each UHF areas:

library(purrr)
library(ggplot2)
library(gganimate)

dfpm2.5 %>%
  ggplot( aes(x=dfpm2.5$Time, y=dfpm2.5$Mean.mcg.m3, group=UHFCODE, color=UHFCODE)) +
    geom_line() +
    geom_point() +
    scale_color_viridis(discrete = FALSE) +
    ggtitle("Pm2.5 evolution per UHF code")  +
    ylab("Mean mcg m3") + xlab("Time")+
    transition_reveal(dfpm2.5$Time)
#> Warning: Use of `dfpm2.5$Time` is discouraged.
#> i Use `Time` instead.
#> Warning: Use of `dfpm2.5$Mean.mcg.m3` is discouraged.
#> i Use `Mean.mcg.m3` instead.
#> Warning: Use of `dfpm2.5$Time` is discouraged.
#> i Use `Time` instead.
#> Warning: Use of `dfpm2.5$Mean.mcg.m3` is discouraged.
#> i Use `Mean.mcg.m3` instead.

#Groups income into 3 levels

interval_income <- cut_interval(Factors_and_PM$MedianIncome,3)
test <- table(Factors_and_PM$UHFCODE, interval_income)

Factors_and_PM_Income_Level <- Factors_and_PM %>%
  mutate(cut_interval(Factors_and_PM$MedianIncome,3)) 

colnames(Factors_and_PM_Income_Level)[11] = "Category"
Factors_and_PM_Income_Level$Category <- as.character(Factors_and_PM_Income_Level$Category)

Factors_and_PM_Income_Level[Factors_and_PM_Income_Level == "[2.83e+04,7.95e+04]"] <- "Low"
Factors_and_PM_Income_Level[Factors_and_PM_Income_Level == "(7.95e+04,1.31e+05]"] <- "Medium"
Factors_and_PM_Income_Level[Factors_and_PM_Income_Level == "(1.31e+05,1.82e+05]"] <- "High"

3.4 Mapping the different factors

We start by mapping out our Vegetative Cover data. This figure gives us the percentage of greenery of a given area. We can see that Manhattan has the lowest values and this perhaps explains why the PM2.5 levels decreased less over the years. In the Bronx, Staten Island and East Queens, we have UHF zones with over 50% of vegetative cover.

# Graph of Vegetative Cover per UHF Code
tm_shape(testshape) +
  tm_polygons(col = "VegePercent",
              style = "quantile",
              title = "") +
tm_layout(main.title = "Vegetative Percentage", main.title.size = 1,
title.position = c("right", "top")) + 
    tm_text("VegePercent", size = 0.5)

Next, we map out the Walking Subway Distance Percentage: The percentage of population within a quarter-mile of a subway station entrance. As predicted, the center of NYC has the highest values while the less urban areas have lower values, even 0.

tm_shape(testshape) +
  tm_polygons(col = "WalkSubDist",
              style = "quantile",
              title = "") +
tm_layout(main.title = "Walking Subway Distance Percentage", main.title.size = 1, 
title.position = c("left", "top")) + 
    tm_text("WalkSubDist", size = 0.5)

Our third map shows us the Bicyle Road Percentage: Percent of Streets with Bicycle Lanes. We see that Manhattan has double digit values for all its UHF codes while the other boroughs have clearly less bike lanes.

tm_shape(testshape) +
  tm_polygons(col = "BicPercent",
              style = "quantile",
              title = "") +
tm_layout(main.title = "Bicycle Road Percentage", main.title.size = 1, 
title.position = c("right", "top")) + 
    tm_text("BicPercent", size = 0.5)

Next, we plot the Subway Density. This represents the count of subways stations divided by the total land area in km2 of the UHF neighborhood. Similar to the Subway Walking Distance data, Manhattan & Brooklyn have higher values while the more rural areas are left out.

tm_shape(testshape) +
  tm_polygons(col = "SubDensity",
              style = "quantile",
              title = "") +
tm_layout(main.title = "Subway Density", main.title.size = 1, 
title.position = c("right", "top"))

Regarding Median Income, we can clearly see Manhattan residents have a better financial situation than the rest of NYC. Between the Upper East Side & East Harlem, the median Income is divided by four. In other boroughs, we have mixed results of poor, middle class and rich UHF areas.

tm_shape(testshape) +
  tm_polygons(col = "MedianIncome",
              style = "quantile",
              title = "") +
tm_layout(main.title = "Median Income", main.title.size = 1, 
title.position = c("right", "top"))

Last but not least, we map the Population Density per Km2 and find that Manhattan has the highest concentrations of population.

tm_shape(testshape) +
  tm_polygons(col = "Popdensity",
              style = "quantile",
              title = "") +
tm_layout(main.title = "Population Density (Per Km2)", main.title.size = 1, 
title.position = c("right", "top")) 

Here, we used shiny with the leaflet package to visualise all the different variables on an interactive map. This visualisation allows us to better distinguish each UHF individually by zooming in and having the area number appear when moving the mouse.

3.5 Comparison of Top 5 most & Top 5 least polluted UHF areas in 2021 against PM2.5

# Create dataframe to link factors with pm2.5 values for 2017 as reference point
# start by renaming GeoID from factors to UHFCODE
colnames(Factors)[1] = "UHFCODE"
testpm2017 <- filter(dfpm2.5, Time == 2017)
head(testpm2017)
#>   Time UHFCODE                Geography Mean.mcg.m3
#> 1 2017     101  Kingsbridge - Riverdale        7.41
#> 2 2017     102          Northeast Bronx        7.46
#> 3 2017     103       Fordham - Bronx Pk        7.52
#> 4 2017     104     Pelham - Throgs Neck        7.52
#> 5 2017     105         Crotona -Tremont        8.16
#> 6 2017     106 High Bridge - Morrisania        8.16
pm2017 <- dplyr::select(testpm2017,UHFCODE,Mean.mcg.m3)
testFactors_and_PM <- merge(pm2017,Factors, by='UHFCODE')

# Remove Geography column
Factors_and_PM <- testFactors_and_PM[,-3]

The below graph illustrates how the five most polluted UHF areas of NYC differ in regards to the observed variables. Through this we aim to see if there are similiraties between these areas in order to have a glimpse of air pollution drivers. There is great dispersion for Vegetative percent and Walking distance to Subways. Traffic density and Bicpercent are rather grouped. All five UHF areas have high Vegetative cover percent, and this may be a predictor for PM2.5. We also see Subway Density is almost the same value for every area, leading us to believe this will not impact air pollution.

# Top5
TOPI <- Factors_and_PM_Income_Level[(order(Factors_and_PM_Income_Level$Mean.mcg.m3, decreasing = FALSE)),]

valuez <- table(TOPI$Mean.mcg.m3[1:5])

Top5 <- filter(Factors_and_PM_Income_Level,Mean.mcg.m3 %in% c(6.32, 6.58,6.75,6.8,6.83)) 

Top5 <- Top5[,-11]

# Pivot Longer
TOP5_longer <- Top5 %>%
  pivot_longer(c("Mean.mcg.m3","VegePercent", "WalkSubDist", "BicPercent", "NeighPercent", "SubDensity","TrafDensity"), names_to = "Borough", values_to = "Values")

TOP5_longer$UHFCODE <- as.character(TOP5_longer$UHFCODE)

# plot
library(ggiraph)
top <- ggplot(TOP5_longer, aes(x=Values, y=Borough,tooltip = UHFCODE)) +
  geom_segment( aes(x=0, xend=Values, y=Borough, yend=Borough)) +
  geom_point_interactive(color="Blue", fill=alpha("black", 0.3), shape=21, stroke=2)+
  ggtitle("Five least polluted UHF areas")+
  theme(axis.text=element_text(size=15))

girafe(ggobj = top)


The below chart shows the common factors of the five most polluted UHF areas. First, we see that the level of Vegetative Percent is strikingly low compared to the top five least polluted areas. Secondly, their traffic density is significantly higher. Finally, the BicPercent is far higher. These seem to be characteristics of a highly busy part of a city.

# Flop5
FLOPI <- Factors_and_PM_Income_Level[(order(Factors_and_PM_Income_Level$Mean.mcg.m3, decreasing = TRUE)),]

valuezz <- table(FLOPI$Mean.mcg.m3[1:5])

Flop5 <- filter(Factors_and_PM_Income_Level,Mean.mcg.m3 %in% c(8.98, 9.01,9.02,9.91,10.33)) 

Flop5 <- Flop5[,-c(9,10,11)]

# Pivot longer
FLOP5_longer <- Flop5 %>%
  pivot_longer(c("Mean.mcg.m3","VegePercent", "WalkSubDist", "BicPercent", "NeighPercent", "SubDensity","TrafDensity"), names_to = "Borough", values_to = "Values")

# plot
library(ggiraph)
flop <- ggplot(FLOP5_longer, aes(x=Values, y=Borough,tooltip = UHFCODE)) +
  geom_segment(aes(x=0, xend=Values, y=Borough, yend=Borough)) +
  geom_point_interactive(color="Blue", fill=alpha("black", 0.3), alpha=0.7, shape=21, stroke=2)+
  ggtitle("Five most polluted UHF areas")+
  theme(axis.text=element_text(size=15))

girafe(ggobj = flop)

4. Analysis

Question 1

How are the thresholds of Air Pollution respected in NYC? In what Proportion?

To answer this question, we have chosen one main threshold of PM2.5 levels given by the USEPA (United States Environmental Protection Agency). This institution has two levels of air pollution to respect : Primary & Secondary. The former is a more strict measure chosen to protect asthmatic people, children & the elderly, while the latter is a general guide providing a safe quality of air for the average Joe. More precisely, these two levels are respectively 12 & 15 mcg/m^3 (micrograms per cubic meter of air). We also decided to compare these American levels with other institutions tracking Air Pollution and found the WHO (World Health Organization) has a far stricter threshold of 5 mcg/m^3.

To calculate the proportion in which these levels are respected or not, we divide the observations by the tresholds and multiply them by 100. This indicates the percentage of drawdown or extension from the respective levels of Air Pollution.

# Analysis Question 1 : Add columns of USEPA & WHO PM2.5 thresholds.
dfpm2.5_index <- dfpm2.5 %>%
  mutate(dfpm2.5,"USEPA_Primary %" = ((Mean.mcg.m3 / 12)-1)*100, "USEPA_Secondary %" = ((Mean.mcg.m3 / 15)-1)*100, "WHO %" = ((Mean.mcg.m3 / 5)-1)*100)

uhfpm_index <- merge(UHFmap, dfpm2.5_index, by = "UHFCODE", duplicateGeoms = TRUE)

The following visualisation shows us the evolution of Air Pollution in regards to the Primary USEPA level(12mcg/m^3), from 2010 to 2021. Ideally, we would like all areas to be green. Anything yellow or orange means the UHF area in question is not respecting the Primary USEPA Air Pollution level. In 2010, we can see that almost all UHF areas are green except for the center of Manhattan. Although is started badly, the same areas air quality respected the levels as of 2015. This means, that until 2015, it wasn’t considered safe to breath the air of Midtown Manhattan if you were asthmatic, a child or an elderly person. In 2021, all UHF areas show a drawdown from 12 mcg/m^3, meaning the air of the entire city of New York is now safe to breath.

## USEPA PRIMARY GUIDELINES
tm_shape(uhfpm_index) +
  tm_polygons(
    col = "USEPA_Primary %",
    n=10,
    palette ="-RdYlGn")+
  tm_facets(by = "Time", ncol = 3) +
  tm_layout(legend.outside.size = 0.2)

The following visualisation shows us how the UHF areas respect the Secondary USEPA level : 15mcg/m^3. Between 2010 and 2021, all the boroughs respected this level and are considered safe for the average life form. However, certain areas came very close to this level, and we decided to use darker red colours to illustrate this. For example, the UHF areas which overshot the Primary USEPA level, were only 5% below this secondary USEPA level, almost making them dangerous for all humans in the area. This is quite alarming given the importance and popularity of this part of the city. As expected after the previous illustration, all UHF areas are now safe in regards to the Secondary USEPA level.

## USEPA SECONDARY GUIDELINES
tm_shape(uhfpm_index) +
  tm_polygons(
    col = "USEPA_Secondary %",
    n=10,
    palette ="-RdYlGn")+
  tm_facets(by = "Time", ncol = 3) +
  tm_layout(legend.outside.size = 0.2)

The below visualisation shows in what percentage the NYC air pollution levels extended past WHO guidelines. Given that NYC barely respected the Primary USEPA levels, it’s quite obvious there will be struggle with this far stricter measure : 5mcg/m^3. Straight away, we see that not a single UHF area since 2010 respects the WHO guidelines. In 2010, the center of Manhattan recorded air pollution levels almost three times greater than what would have been necessary to be considered safe in the eyes of the WHO. Throughout the 2010s, we see slow improvements. In 2020, we see that there are important parts of Staten Island, Brooklyn and the Bronx which were in line with the WHO and this is certainly due to COVID-lockdowns having halted flights and many forms of emission-creating transport.

## WHO GUIDELINES 
tm_shape(uhfpm_index) +
  tm_polygons(
    col = "WHO %",
    breaks = c(0,20,40,60, 80, 100, 120, 140, 160, 180, 200),
    palette ="-RdYlGn")+
  tm_facets(by = "Time", ncol = 3) +
  tm_layout(legend.outside.size = 0.2)

Now, we would like to compare the evolution of PM2.5 levels across the five boroughs and use visualisations which may speak to us more. To begin, we gather all the necessary data and calculate the AveragePM2.5 level per year, per borough. To do this, we calculate the mean of all UHF areas in their respective Boroughs. We do this for every borough, merge all the data together into one table, of which we’ve illustrated the first ten lines below.

MeanPM2.5_Bronx <- dplyr::select(dfpm2.5, UHFCODE, Mean.mcg.m3, Time) %>% 
  filter(UHFCODE %in% c(101:107) & Time %in% c(2010:2021)) %>%
  group_by(Time) %>%
  summarise("Bronx" = mean(Mean.mcg.m3[Time %in% c(2010:2021)]))

MeanPM2.5_Brooklyn <- dplyr::select(dfpm2.5, UHFCODE, Mean.mcg.m3, Time) %>% 
  filter(UHFCODE %in% c(201:211) & Time %in% c(2010:2021)) %>%
  group_by(Time) %>%
  summarise("Brooklyn" = mean(Mean.mcg.m3[Time %in% c(2010:2021)]))

MeanPM2.5_Manhattan <- dplyr::select(dfpm2.5, UHFCODE, Mean.mcg.m3, Time) %>% 
  filter(UHFCODE %in% c(301:310) & Time %in% c(2010:2021)) %>%
  group_by(Time) %>%
  summarise("Manhattan" = mean(Mean.mcg.m3[Time %in% c(2010:2021)]))

MeanPM2.5_Queens <- dplyr::select(dfpm2.5, UHFCODE, Mean.mcg.m3, Time) %>% 
  filter(UHFCODE %in% c(401:410) & Time %in% c(2010:2021)) %>%
  group_by(Time) %>%
  summarise("Queens" = mean(Mean.mcg.m3[Time %in% c(2010:2021)]))

MeanPM2.5_StatenIsland <- dplyr::select(dfpm2.5, UHFCODE, Mean.mcg.m3, Time) %>% 
  filter(UHFCODE %in% c(501:504) & Time %in% c(2010:2021)) %>%
  group_by(Time) %>%
  summarise("Staten Island" = mean(Mean.mcg.m3[Time %in% c(2010:2021)]))

MeanPM2.5PerBorough <- merge(merge(merge(merge(MeanPM2.5_Bronx, MeanPM2.5_Brooklyn), MeanPM2.5_Manhattan), MeanPM2.5_Queens), MeanPM2.5_StatenIsland)

# Displaying first 5 rows in a table
kable((MeanPM2.5PerBorough[1:10,]),"html") %>%
  kable_styling(full_width = F)
Time Bronx Brooklyn Manhattan Queens Staten Island
2010 10.31 9.81 11.66 9.21 8.84
2011 10.93 10.34 11.99 9.77 9.50
2012 9.65 9.15 10.98 8.64 8.38
2013 9.22 8.82 10.80 8.38 8.05
2014 9.37 9.23 10.95 8.62 8.50
2015 9.32 8.91 10.25 8.37 7.87
2016 7.82 7.93 9.07 7.20 6.95
2017 7.78 7.66 8.78 7.21 6.91
2018 7.31 7.33 8.52 6.84 6.49
2019 6.96 6.82 8.37 6.49 6.00

Next, we use the pivot_longer function to facilitate the task of creating our visualisations with ggplot. Now, our table uses Time, Borough & Average PM2.5 columns. Below, we see the first ten lines of this table.

MeanPM2.5PerBorough_Longer <- MeanPM2.5PerBorough %>%
  pivot_longer(c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"), names_to = "Borough", values_to = "Average PM2.5")

# Displaying first 5 rows in a table
kable((MeanPM2.5PerBorough_Longer[1:5,]),"html") %>%
  kable_styling(full_width = F)
Time Borough Average PM2.5
2010 Bronx 10.31
2010 Brooklyn 9.81
2010 Manhattan 11.66
2010 Queens 9.21
2010 Staten Island 8.84

These data manipulations help us to better visualize the evolution of PM2.5 across the city. Instead of seeing color evolution on a map, which could lead to inprecise analysis, we have used various charts. First, we have illustrated PM2.5 evolution per borough on a stacked histogram. Here, we see a rather steep decline in PM2.5 levels in the Bronx, Brooklyn and Manhattan. The decrease is a bit slower in Queens & Staten Island, but these areas were less affected by air pollution to begin with.

ggplot(MeanPM2.5PerBorough_Longer, aes(fill=Borough, y = `Average PM2.5`, x = Time)) + 
  geom_bar(position="stack", stat = "identity") +
  theme(axis.text=element_text(size=15))

In this second visualization, we aim to see which borough has reduced its air pollution levels the most. To calculate this, we simply divide the 2021 PM2.5 figure by the 2010 PM2.5 figure, and convert it to a percentage. Then, we use ggplot to plot the PM2.5_Evolution column and represent the values by descending order. The chart shows that Manhattan has had the greatest decrease in air pollution. As shown in previous visualisations, it was the worst in 2010, so this is positive news.

MeanPM2.5PerBorough21 <- filter(MeanPM2.5PerBorough_Longer, Time == 2021)
MeanPN2.5PerBourough10 <- filter(MeanPM2.5PerBorough_Longer, Time == 2010)

PM2.5_Evolution_Since2010 <-MeanPM2.5PerBorough21 %>%
  mutate("PM2.5_Evolution" = ((`Average PM2.5` / MeanPN2.5PerBourough10$`Average PM2.5`)-1)*100)

PM2.5_Evolution_Since2010
#> # A tibble: 5 x 4
#>    Time Borough       `Average PM2.5` PM2.5_Evolution
#>   <dbl> <chr>                   <dbl>           <dbl>
#> 1  2021 Bronx                    6.66           -35.4
#> 2  2021 Brooklyn                 6.73           -31.4
#> 3  2021 Manhattan                7.45           -36.2
#> 4  2021 Queens                   6.48           -29.7
#> 5  2021 Staten Island            6.21           -29.8

ggplot(data=PM2.5_Evolution_Since2010, aes(x = reorder(Borough, -PM2.5_Evolution), y=PM2.5_Evolution)) +
  geom_bar(stat="identity", fill="steelblue")+
  theme_minimal()+
  xlab("Borough")+ 
  ylab("PM 2.5 difference")+
  ggtitle("Difference of PM2.5: 2010 vs 2021 per Borough")+ 
  theme(plot.title = element_text(color = "black", size = 14, face = "bold"))

In our third and final chart for question 1, we have used the infamous ggplot pie chart to gain a better insight into how each borough competes with one another in 2021. For a borough to boast a good position on the PM2.5 leaderboard, they should have the smallest slice of the following pie chart. Overall, the differences between the boroughs are not striking, and this is why we integrated the values into the chart. Thanks to these, we clearly see how Manhattan still has the worst Air Pollution levels, while Staten Island has the lowest.

PM2.5_Evolution_Since2010 <- PM2.5_Evolution_Since2010 %>% 
  arrange(desc(Borough)) %>%
  mutate(prop = PM2.5_Evolution_Since2010$`Average PM2.5` / sum(PM2.5_Evolution_Since2010$`Average PM2.5`) *100) %>%
  mutate(ypos = cumsum(prop)- 0.5*prop )

PM2.5_Evolution_Since2010$`Average PM2.5`<- 
  as.character(round(PM2.5_Evolution_Since2010$`Average PM2.5`, 2))

ggplot(PM2.5_Evolution_Since2010, aes(x="", y=prop, fill=Borough))+
  geom_bar(stat = "identity", width = 10)+
  coord_polar("y", start=0) +
  theme_void() + 
  #theme(legend.position="none") +
  geom_text(aes(y = ypos, label = PM2.5_Evolution_Since2010$`Average PM2.5`), color = "white", size=4) +
  scale_fill_brewer(palette="Set1") 

In conclusion, the analysis of PM2.5 levels between 2010 and 2021 has clearly shown us great improvements in air quality. While the situation did not get off to a great start in busier UHF areas of Midtown Manhattan, the entire city is now under both USEPA thresholds for safe air pollution levels. However, depending on which Air Quality guideline you choose, USEPA or WHO, the story is not the same.

Question 2

What are the main characteristics of an area contributing to Air Pollution? Do the Boroughs have distinct characteristics?

For question 2, we shall use our Factors table to determine what drives Air Pollution in UHF areas. Our hypothesis is that the three main predictors of Air Pollution levels are : Vegetative cover, Traffic density & Population density. We think Vegetative Percent will be negatively correlated with other variables linked to the use of city space, such as traffic density needing more space for roads instead of green areas, or high population density forcing construction of habitable buildings instead of parks. We expect Traffic density to contribute directly to Air Pollution through emissions. In a similar way, Population density should play an important role as the more people are present in an area, the more the activity is concentrated in a given space.

In order to check our hypotheses and intuitions, we will start our analysis by plotting the relationships between the factors. From Factors, we must remove the Geography column as it is characters, and we remove PM2.5 as well. We use the corrplot package to run our correlation tables. Our first table shows us circles, varying positively in size when the correlation is greater. We can also directly demonstrate the correlation in values, as you can see in the second correlation matrix. First, we can observe Vegetative Percent has a strong negative relation with all other variables. Secondly, Bicycle percent has a positive relationship with all factors except vegetative cover.

# Analysis of Factors without PM2.5
# Load corrplot package to run matrix of correlation
library(corrplot)

# We need to remove geography column as it is chr, and we remove PM2.5 as well
# Visualization
corrplot(cor(Factors[,-c(1,2)]), order = 'FPC', type = 'lower', diag = FALSE)

# With numerical values
corrplot(cor(Factors[,-c(1,2)]), order = 'FPC', method = 'number', type = 'lower', diag = FALSE, number.cex=0.75)

We create a dataframe to link factors with PM2.5 values from 2017 as a reference point. We rename the GeoID column to UHFCODE and remove the Geography column.

# Create dataframe to link factors with pm2.5 values for 2017 as reference point
# start by renaming GeoID from factors to UHFCODE
colnames(Factors)[1] = "UHFCODE"
testpm2017 <- filter(dfpm2.5, Time == 2017)
head(testpm2017)
#>   Time UHFCODE                Geography Mean.mcg.m3
#> 1 2017     101  Kingsbridge - Riverdale        7.41
#> 2 2017     102          Northeast Bronx        7.46
#> 3 2017     103       Fordham - Bronx Pk        7.52
#> 4 2017     104     Pelham - Throgs Neck        7.52
#> 5 2017     105         Crotona -Tremont        8.16
#> 6 2017     106 High Bridge - Morrisania        8.16
pm2017 <- dplyr::select(testpm2017,UHFCODE,Mean.mcg.m3)
testFactors_and_PM <- merge(pm2017,Factors, by='UHFCODE')

# Remove Geography column
Factors_and_PM <- testFactors_and_PM[,-3]

In the following charts, we will use ggplot to demonstrate the behaviour between our main variables against PM2.5 levels.

For our first graph, we look at the link between air pollution and population density. We see that, the more crowded a UHF area is, the more pollution it has. The Rsquared of this regression is 36% and the variable is significant at the 5% Alpha level.

# PM2.5 against Pop Density
# Plot
ggplot(Factors_and_PM,aes(x=Popdensity, y=Mean.mcg.m3)) +
  labs(title = "The more crowded an UHF is, the more pollution it has", x =
         "Population Density (KM2)", y = "Mean PM2.5") +
  stat_summary(fun.Factors_and_PM=mean_cl_normal) + 
  geom_smooth(method='lm')

# Regression table
# We install the jtools package
library(jtools)
# create variable for linear regression
fit <- lm(Factors_and_PM$Mean.mcg.m3 ~ Factors_and_PM$Popdensity)
# show table
summ(fit,model.info= FALSE)
F(1,40) 22.42
0.36
Adj. R² 0.34
Est. S.E. t val. p
(Intercept) 7.03 0.19 36.80 0.00
Factors_and_PM$Popdensity 0.00 0.00 4.73 0.00
Standard errors: OLS

Let’s explore the link between traffic density and pollution levels. On the following graph, we can see how more traffic is correlated with a higher PM2.5 level. The Rsquared is slightly higher in this case, 53% and still very significant at Alpha = 5%.

# PM2.5 against Traffic Density
# Plot
ggplot(Factors_and_PM,aes(x=TrafDensity, y=Mean.mcg.m3)) +
  labs(title = "More traffic is also equivalent to increased levels in air pollution", x =
         "Traffic Density", y = "Mean PM2.5") +
  stat_summary(fun.Factors_and_PM=mean_cl_normal) + 
  geom_smooth(method='lm')

# Regression Table
fittrafdensity <- lm(Factors_and_PM$Mean.mcg.m3 ~ Factors_and_PM$TrafDensity)
summ(fittrafdensity,model.info= FALSE)
F(1,40) 44.74
0.53
Adj. R² 0.52
Est. S.E. t val. p
(Intercept) 6.45 0.22 29.65 0.00
Factors_and_PM$TrafDensity 0.04 0.01 6.69 0.00
Standard errors: OLS

On this final graph, we look at the relationship between Vegetative Percent and air pollution. We see a strong negative relationship visually. An RSquared of 51% proves this and the variable is significant.

# PM2.5 against Vegetative Coverage
ggplot(Factors_and_PM,aes(x=VegePercent, y=Mean.mcg.m3)) +
  labs(title = "Greener areas tend to be less polluted", x =
         "Vegetative P%", y = "Mean PM2.5") +
  stat_summary(fun.Factors_and_PM=mean_cl_normal) + 
  geom_smooth(method='lm')

fitvegepercent <- lm(Factors_and_PM$Mean.mcg.m3 ~ Factors_and_PM$VegePercent)

summ(fitvegepercent,model.info= FALSE)
F(1,40) 40.91
0.51
Adj. R² 0.49
Est. S.E. t val. p
(Intercept) 9.27 0.25 36.47 0.00
Factors_and_PM$VegePercent -0.05 0.01 -6.40 0.00
Standard errors: OLS

Now that we have confirmed some of our intuitions we are going to see which variables are the most significant for our model. To do so, we will use a backward stepwise regression. We start by defining our model with all factor variables, and we eliminate the one with the less explanatory power.

However, as we’ve seen in the correlation matrix, there seems to be variables with rather high correlations. For example, neighborhood poverty percent and Median Income are strongly correlated with each other. Therefore, we shall start by checking our variables for multicollinearity. To do so, we first create a model with all the factors and use the vif() function.

modelfactors <- lm(Mean.mcg.m3 ~ VegePercent + WalkSubDist + BicPercent + NeighPercent + SubDensity + TrafDensity + MedianIncome + Popdensity, data = Factors_and_PM)

library(car)

#create vector of VIF values
VIFvalues <- vif(modelfactors)

#create horizontal bar chart to display each VIF value
barplot(VIFvalues, main = "VIF Values", las=1, cex.names=0.55, horiz = TRUE, col = "steelblue")

#add vertical line at 5 as after 5 there is severe correlation
abline(v = 5, lwd = 3, lty = 2)

#summ(modelfactors)
#export_summs(modelfactors)

library(sjPlot)
library(sjmisc)
library(sjlabelled)

tab_model(modelfactors, title = "Linear regression of PM2.5 against all factors")
Linear regression of PM2.5 against all factors
  Mean.mcg.m3
Predictors Estimates CI p
(Intercept) 6.27 4.93 – 7.60 <0.001
VegePercent -0.02 -0.04 – -0.01 0.002
WalkSubDist -0.00 -0.01 – 0.01 0.466
BicPercent 0.00 -0.02 – 0.03 0.762
NeighPercent 0.04 0.00 – 0.07 0.025
SubDensity 0.09 -0.14 – 0.31 0.438
TrafDensity 0.02 0.01 – 0.04 0.002
MedianIncome 0.00 0.00 – 0.00 0.022
Popdensity 0.00 -0.00 – 0.00 0.839
Observations 42
R2 / R2 adjusted 0.811 / 0.765

As we can see on this first VIF Values chart, we have high chances of multicollinearity with the Median Income variable, as its value is greater than 5. Therefore, we choose to remove it from our model.

We conduct the same VIF test as prior.

modelfactors2 <- lm(Mean.mcg.m3 ~ VegePercent + WalkSubDist + BicPercent + NeighPercent + SubDensity + TrafDensity + Popdensity, data= Factors_and_PM)

library(car)

#create vector of VIF values
VIFvalues2 <- vif(modelfactors2)

barplot(VIFvalues2, main = "VIF Values", horiz = TRUE, col = "steelblue", cex.names = 0.55, las=1) #create horizontal bar chart to display each VIF value

abline(v = 5, lwd = 3, lty = 2)  #add vertical line at 5 as after 5 there is severe correlation

Now, all our variables have VIF coefficients below 5. However, BicyclePercent’s VIF is close to 5. We still decide to keep it for our research.

We shall now apply backward integration on the remaining variables, in order to see which ones are the most relevant.

# We do a backward regression
library(MASS)
backward.model <- step(modelfactors2, direction = "backward",trace=0)
finalmodel <- lm(Mean.mcg.m3 ~ VegePercent +  SubDensity + TrafDensity, data = Factors_and_PM)
library(sjPlot)
library(sjmisc)
library(sjlabelled)

tab_model(finalmodel, title = "Final model")
Final model
  Mean.mcg.m3
Predictors Estimates CI p
(Intercept) 7.47 6.81 – 8.13 <0.001
VegePercent -0.02 -0.04 – -0.01 0.002
SubDensity 0.22 0.08 – 0.36 0.004
TrafDensity 0.02 0.01 – 0.04 <0.001
Observations 42
R2 / R2 adjusted 0.769 / 0.751

We see that the variables which have been kept are Vegetative Percent, Subway density & Traffic density. Our intuitions were correct, apart from Population density being replaced by Subway density.

We can analyze the residuals of our model, and see whether or not they respect the 3 following assumptions: 1) they follow a Normal distribution 2) they are statistically independent from one another 3) They have equal variances when compared to each values of our sample

We start by plotting the residuals, but it is hard to tell if they are normally distributed.

# We start with a plot
hist(backward.model$residuals, main= "Histogram of Residuals", xlab = "Residuals of backward model")

We check this with a qqplot from the car package, and they appear to follow a normal distribution.

# Creating the qqplot
qqPlot(backward.model$residuals, distribution = "norm")
#> [1]  8 25

Finally, we can do a Jarque-Bera test to confirm this, thanks to the tseries package. As the p-value is 0.4, we cannot reject our null hypothesis and we confirm that it is likely following a normal distribution.

library(tseries)
jbtmodel <- jarque.bera.test(backward.model$residuals)

Next, we plot our fitted values against our residuals. The residuals don’t seem to follow a particular pattern and shows that they might not be correlated between each other.

plot(backward.model$fitted.values, backward.model$residuals, xlab="Fitted values", ylab="Residuals")

We conduct the ANOVA test. We do not reject the hypothesis that Neighborhood Poverty in % has a different mean than the other variable. It appears that all the variables have different variances.

# Anova test
anovamodel <- anova(backward.model)

We run our last VIF test for our new model and we see that all the values are quite low.

# We do not reject the hypothesis that Neighborhood % has a different mean than the other variable
VIFnewmodel <- vif(backward.model)

barplot(VIFnewmodel, main = "VIF Values", horiz = TRUE, col = "steelblue", cex.names = 0.55, las=1) #create horizontal bar chart to display each VIF value

abline(v = 5, lwd = 3, lty = 2)  #add vertical line at 5 as after 5 there is severe correlation

Now that we know which variables from our factors are the most predictive of air pollution levels, we can analyse the influence on the various UHF areas. First of all, we thought to check if the characteristics of each borough are distinguishable with each other, or if they can be considered homogeneous. To do so, we start by clustering all the UHF areas with all their characteristics.

# CLustering
Factors_and_PM$UHFCODE <- as.character(Factors_and_PM$UHFCODE)

distmat <- dist(Factors_and_PM[,-1], method="euclidean", diag=TRUE,upper=TRUE)
coclust <- hclust(distmat, method="complete") 
plot(coclust, labels=Factors_and_PM[,1]) 
rect.hclust(coclust, k=5, border="red")

By using the euclidean method for our distance matrix, we return the above dendogram for five clusters. As we can see, it’s hard to establish a signifcant difference between the boroughs. If it were the case, each cluster would have values starting with the same number. We can see some resemblance between clusters : in the first cluster there are four values from the Bronx, and four values of Manhattan in the last. Other than these two, they are still quite mixed. It is interesting to point out how UHF 310 (Union Square - Lower Manhattan) is alone in the fourth cluster.

data2.sc <- scale(Factors_and_PM[,-1]) 
row.names(data2.sc) <- Factors_and_PM[,1]

# Load factorextra() library
library(factoextra)
fviz_nbclust(data2.sc, kmeans, method="wss")

Using the elbow method, we determine an optimal number of clusters. Here, as the rate decline slows drastically at 3 clusters, we choose this value.

ct3 <- cutree(coclust, k=3) 
plot(coclust, labels=Factors_and_PM[,1]) 
rect.hclust(coclust, k=3, border="red")

However, when plotting the three clusters, the results are not convincing. We can now conclude there are no significant differences between the UHF areas. To further this, we create three visualizations for each of our three variables. They show us that the UHF values are quite similar, but they are still rather mixed. The clear differences are in the extremes, for example between Staten Island & Manhattan.

We can see this trend more clearly by plotting our variables with each boroughs. For Vegetative cover, we see that some values related to Manhattan or Brooklyn are somewhat close together. However, they also blend with other UHF codes from other Boroughs.

# Using ggrepel to see visually the boroughs
# Creating dataframe: UHF CODE with Borough

UHFCODE <- c(c(101:107),c(201:211),c(301:310),c(401:410),c(501:504))

Bronx <- "Bronx"
Brooklyn <- "Brooklyn"
Manhattan <- "Manhattan"
Queens <- "Queens"
StatenIsland <- "Staten Island"

Borough <- c(rep(Bronx,7),rep(Brooklyn,11),rep(Manhattan,10),rep(Queens,10),rep(StatenIsland,4))

UHF_Borough <- cbind(UHFCODE,Borough)
UHF_Borough <- as.data.frame(UHF_Borough)
UHF_Borough$UHFCODE <- as.character(UHF_Borough$UHFCODE)

Factors <- (list(dfvegetative_cover[,-1],dfwalk_dist_subway[,c(2,4)],dfbicycle_network[,c(2,4)],dfNeighborhood_Poverty[,c(1,3)], dfsubway_density[,c(2,4)], dftraffic_density[,c(2,4)],dfincome[,c(1,2)],dfpop_density[,c(1,4)]) %>% reduce(inner_join, by='GeoID'))

Factors_and_PM2 <- Factors_and_PM %>% 
  left_join(UHF_Borough, by="UHFCODE")

# Graphical representation with Vegetative Percent
ggplot(Factors_and_PM2, aes(VegePercent,Mean.mcg.m3 )) +
  geom_point(aes(color = Borough)) +
  geom_point(size = 3, shape = 1, data = Factors_and_PM2) 

In the following graph, we can see how the extrem can more easily be separated. The values for manhattan reagrding traffic density, are way higher than those of the other boroughs.

# Traffic Density
ggplot(Factors_and_PM2, aes(TrafDensity,Mean.mcg.m3 )) +
  geom_point(aes(color = Borough)) +
  geom_point(size = 3, shape = 1, data = Factors_and_PM2)

The same trend can be observed when plotting for subway density.

# Subway Density
ggplot(Factors_and_PM2, aes(SubDensity,Mean.mcg.m3 )) +
  geom_point(aes(color = Borough)) +
  geom_point(size = 3, shape = 1, data = Factors_and_PM2)

To sum up, we can conclude that the three most important predictive variables for PM2.5 air pollution are Vegetative Percent, Traffic density & Subway density. Through manipulation of model choice and various statistical techniques, we’ve also found that there are no significant differences between the boroughs in regards to all their characteristics.

Question 3

Are there disparities in levels of Air Pollution between wealthier and poorer areas?

As New York City has its fair share of socio-economic inequalities, we wanted to analyse if less fortunate UHF areas have worse levels of Air Pollution, and if richer UHF areas are better off in regards to breathable air quality. For this, we have created three groups of Median Income to segment the areas: Low($28’300 to $79’500), Medium($79’500 to $131’000) & High($131’000 to $182’000).

# Groups income into 3 levels

interval_income <- cut_interval(Factors_and_PM$MedianIncome,3)
test <- table(Factors_and_PM$UHFCODE, interval_income)

Factors_and_PM_Income_Level <- Factors_and_PM %>%
  mutate(cut_interval(Factors_and_PM$MedianIncome,3)) 

colnames(Factors_and_PM_Income_Level)[11] = "Category"
Factors_and_PM_Income_Level$Category <- as.character(Factors_and_PM_Income_Level$Category)

Factors_and_PM_Income_Level[Factors_and_PM_Income_Level == "[2.83e+04,7.95e+04]"] <- "Low"
Factors_and_PM_Income_Level[Factors_and_PM_Income_Level == "(7.95e+04,1.31e+05]"] <- "Medium"
Factors_and_PM_Income_Level[Factors_and_PM_Income_Level == "(1.31e+05,1.82e+05]"] <- "High"

We can see that the UHF areas belonging to the Low Income group have the lowest level of median PM2.5 air pollution level. The dispersion in the Medium Income group is rather stretched, meaning the UHF areas have varying levels of air pollution levels and many values in the upper quartile. The high income group has the highest median level of air pollution with a squeezed dispersion, showing us the data is more homogeneous.

# Boxplot 
ggplot(data=Factors_and_PM_Income_Level, mapping=aes(x= Category, y=Mean.mcg.m3))+
  geom_boxplot()+ggtitle("Relationship between income groups & PM2.5 levels")

The violon boxplot below gives us a better indication on the income dispersion.

library(ggplot2)
library(dplyr)
library(hrbrthemes)
library(viridis)

ggplot(Factors_and_PM_Income_Level, aes(x=Category, y=Mean.mcg.m3, fill=Category)) +
    geom_violin(width=1.5) +
    geom_boxplot(width=0.2, color="grey", alpha=0.2) +
    scale_fill_viridis(discrete = TRUE) +
    theme_ipsum() +
    theme(text = element_text(size=20),
      legend.position="none",
      plot.title = element_text(size=15)
    ) +
    ggtitle("Relationship between income groups & PM2.5 levels") +
    xlab("")

Next, we compare Neighborhood Poverty percent and median income against average PM2.5 levels

For Neighborhood Poverty, there is great dispersion in the observations. Indeed, the nature of this variable creates difficulty to interpret the graph. The fact that a given UHF area has a low poverty rate, does not necessarily imply that it can be considered to have a financial advantage. Therefore, we choose to focus more on the Median Income levels to answer our research question. We can see a clear upward trend showing that richer areas tend to be more polluted than the others.

#Compare poverty variable & other variables
Factors_and_PM_Income_Level %>%
  filter(NeighPercent >= 5, NeighPercent <= 39) %>%
  ggplot(aes(NeighPercent, Mean.mcg.m3)) +
    geom_point() + geom_smooth()
 
Factors_and_PM_Income_Level %>%
  filter(MedianIncome >=2.83e+04 , MedianIncome <=1.82e+05) %>%
  ggplot(aes(MedianIncome, Mean.mcg.m3)) +
    geom_point() + geom_smooth()

By filtering our UHF’s based on their income, we wanted to compare the richest UHF (305: Union Square - Lower Manhattan) with the poorest (107: South Bronx). By using our common factors, we can display their main differences.

On the graph below, we used all of our variables. However, even by applying a logarithmic transformation, the highs values of median income and population density were making it hard to compare the other factors. We can note that these 2 factors are higher for Union Square.

options(scipen=999)

Factors_and_PM$UHFCODE <- as.character(Factors_and_PM$UHFCODE)
new <- filter(Factors_and_PM,UHFCODE %in% c("107","305")) 

new2 <- new[,-11]

new_longer <- new2 %>%
  pivot_longer(c("Mean.mcg.m3","VegePercent", "WalkSubDist", "BicPercent", "NeighPercent", "SubDensity","TrafDensity","MedianIncome","Popdensity"), names_to = "Borough", values_to = "Values")

new_longer %>%
   ggplot(aes(x = Borough, y=Values, col = UHFCODE, group = UHFCODE)) +
   geom_line() + scale_y_log10()+geom_point()+
  ggtitle(label = "Comparison of richest vs poorest UHF area") +
  theme(axis.text=element_text(size=9))

By removing Median income and population density we get the following graph:

options(scipen=999)

Factors_and_PM_Income_Level$UHFCODE <- as.character(Factors_and_PM_Income_Level$UHFCODE)

new <- filter(Factors_and_PM_Income_Level,UHFCODE %in% c("107","305")) 

new2 <- new[,-11]

new_longer <- new2 %>%
  pivot_longer(c("Mean.mcg.m3","VegePercent", "WalkSubDist", "BicPercent","TrafDensity", "SubDensity"), names_to = "Borough", values_to = "Values")

new_longer %>%
   ggplot(aes(x = Borough, y=Values, col = UHFCODE, group = UHFCODE)) +
   geom_line() + geom_point()+
  ggtitle(label = "Comparison of richest vs poorest UHF area") +
  theme(axis.text=element_text(size=9))

We can see more clearly that Union Square has a much higher traffic density than South Bronx. It also has a more prominent Bicycle network. The amount of PM2.5 is marginally higher in Union Square, but the difference is not significant.

By plotting the trends of PM2.5 over the past decade, we find that in 2010, the disparity between the air quality of the richest and poorest UHF areas was greater, but by 2021, the gap had narrowed considerably. Hence, it can be concluded that living in a high or low socioeconomic area today, will not drastically affect the quality of breathable air.

library(purrr)
library(ggplot2)
library(gganimate)

richandpoor <-dfpm2.5 %>%
  dplyr::filter(dfpm2.5$UHFCODE %in% c(107,305))

richandpoor$UHFCODE <- as.character(richandpoor$UHFCODE)

richandpoor%>%
  ggplot(aes(x=Time, y=Mean.mcg.m3,group=UHFCODE, color = UHFCODE)) +
    geom_line() +
    geom_point() +
    ggtitle("Comparison of Pm2.5 evolution for the richest vs poorest UHF (2010-2021)")  +
    ylab("Mean mcg m3") + xlab("Time")+
    transition_reveal(Time)

5. Conclusion

5.1 Main Conclusions

  1. How are the thresholds of Air Pollution respected in NYC? In what Proportion?

To answer this question, we looked at the timeframe from 2010 until 2021. The data showed us that air pollution levels were quite high in 2010 and certain neighborhoods of Midtown Manhattan were considered a health threat, surpassing the USEPA Primary PM2.5 level. Our analysis and visualisations have shown that despite a bad start in 2010, air quality levels across New York City have decreased across the entire city. However, when comparing USEPA with WHO standards, we discovered the entire city is considered dangerous to breathe in. To us, this was quite surprising and we follow the opinion that WHO’s regulations are far too strict for now. If NYC continues the downward trend in air pollution, it may be able to comply with the WHO. 2020 was subject to far less air pollution given that there were almost no flights for an important part of the year due to COVID. In the future, it will be interesting if the levels are pursuing their decline in PM2.5 levels.

  1. What are the main characteristics of an area contributing to Air Pollution? Do the Boroughs have distinct characteristics?

Initially, our hypothesis was that only a few variables would be true predictors of PM2.5 levels in New York City. We saw that Vegetative Percent, Traffic density and subway density and the most important factors for predicting Air Pollution. Using clustering methods, we saw it was difficult to group the boroughs together. Some share similar characteristics, such as UHFs from the Bronx or Manhattan, but others are mixed with other Boroughs. Therefore, it is hard to conclude that they are truly different.

  1. Are there disparities in levels of Air Pollution between wealthier and poorer areas?

Our research has shown that the richest parts of New York City happen to be among the most populated and subject to the highest levels of Traffic density. In addition, we’ve seen that the rate of decrease in Air Pollution has been similar in low and high income UHF areas. Overall, it’s rather difficult to find any sort of link between wealth and air pollution. Therefore, we would tend to answer negatively to this question, which is actually positive in the grand scheme of social equality.

5.2 Limitations

  1. Missing time-values of certain variables : we would have found interesting to explore the temporal aspect of our variables to truly determine why air pollution levels have decreased across the city.
  2. Spatial data : we would have like to have used Zip Codes rather than UHF areas in order to have more observations which would increase the precision of our statistical analysis.
  3. Emissions data : Drawing on Population & Traffic density, it would have been great to get actual figures about emissions in the city. For example, a csv file of emissions of every building which we would have regrouped by UHF area.

5.3 Group Take

Throughout the semester, this project taught us vital lessons in the use of R code and how to construct a real Data Science Analysis. While it was sometimes tough to find errors and deal with R, we would like to thank the assistants for being there every Wednesday morning to help us out. This course was definitely a pleasure and we shall recommend it to next year’s MScM students.

Many thanks for your attention and we wish you all the best, Julian, Charlène & Henri